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Abstract 

We are interested in comparing probability distributions denned 
on Riemannian manifold. The traditional approach to study a dis- 
tribution relies on locating its mean point and finding the dispersion 
about that point. On a general manifold however, even if two distribu- 
tions are sufficiently concentrated and have unique means, a compar- 
ison of their covariances is not possible due to the difference in local 
parametrizations. To circumvent the problem we associate a covari- 
ance field with each distribution and compare them at common points 
by applying a similarity invariant function on their representing matri- 
ces. In this way we are able to define distances between distributions. 
We also propose new approach for interpolating discrete distributions 
and derive some criteria that assure consistent results. Finally, we 
illustrate with some experimental results on the unit 2-sphere. 

1 Introduction 

The problem of comparing distributions denned on a non- Euclidean space 
or to be more specific, a Riemannian manifold, becomes increasingly impor- 
tant. A typical example of non-trivial manifold is the unit 2-sphere § 2 , which 
is the domain of our experiments in this work. In this sense, our study has as 
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main application problems from directional statistics, a branch of statistics 
dealing with directions and rotations in R 3 . 

Pioneers in the field are Fisher, R. A. (1953) and von Mises. In recent 
years directional statistics proved to be useful in variety of disciplines like 
shape analysis [26J, geology, crystallography |2j], bio-informatics [28] and 
data mining (3j. Most of the practitioners in these fields use parametric 
distributions to model directional data, like von Mises-Fisher distribution 
and Fisher-Bingham- Kent (FBK) distributions. 

There are application areas however, where parametric models are insuf- 
ficient. A recent example is provided by medical imaging community. In 
a new technique based on MRI and called High Angular Resolution Diffu- 
sion Imaging (HARDI), the data is represented by Orientation Distribution 
Functions (ODFs) which are nothing but discrete distributions on the unit 
2-sphere. These distributions by their nature are multi-modal - they are not 
concentrated about a particular direction. They do not follow a parametric 
model and even if they do the eventual model would be too complicated to 
be efficient. Consequently, a non-parametric approach is more natural in 
processing ODFs. 

In analysis of HARDI data researchers first have to solve the problem of 
registration between different volumes of ODFs, corresponding to the images 
of different subjects. For this purpose they need models and algorithms for 
interpolation between ODFs. Second, researchers are interested in comparing 
different groups of subjects using HARDI imaging. Usually, a statistical pro- 
cedure is employed and hypotheses are tested. However, comparison between 
volumes requires comparison between corresponding ODFs and no standard 
method for this is available. A third problem in processing HARDI data 
is building connectivity paths for a given volume. Once again we need a 
consistent way to interpolate between ODFs in order to follow an optimal 
propagating direction. 

There are no many choices for interpolation procedure beyond the sim- 
plest linear one. A recent alternative, using the square root representation 
of probability mass functions, was proposed by Srivastava(2007) and imple- 
mented in [9J. No existing solution though respects the geometry of the 
underlying domain. 

In conclusion, we need more models and non-parametric procedures for 
comparing and interpolating distributions on the sphere and on Riemannian 
manifolds in general. Approaches that address the non- Euclidean nature of 
the random variables and provide adequate solutions. It is the main subject 
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of this paper to draw some new directions for searching of possible solutions. 

What we propose basically is a generalization of the classical concept of 
covariance of distribution. We allow covariance to be defined with respect to 
any point of distribution domain and by doing so we try to workaround the 
problem of finding the mean point, which might not exist or be ambiguous. 
Also, since compact manifolds like S 2 do not admit global parametrizations, 
we pay special attention to use the correct mathematical tool for describing 
the covariance. We not only point out to the well known fact that covariance 
can be viewed as a bi-linear operator and thus defined as a tensor, but specify 
the exact variance of this tensor. It is important to make a distinction be- 
tween covariance tensor and metric tensor on manifold. A central observation 
in our approach is that at any point of the domain, the product of the metric 
and covariance tensors is a linear operator on the respected tangent space. 
We call it covariance operator. Collectively they form a field of operators. 
Then we introduce instruments, the so called similarity invariant functions, 
that can be used to study properties of these fields and to manipulate them. 

After a formal introduction to the concept of covariance operators in 
section 2, we continue in section 3 with motivating examples showing the 
advantages of the new approach. We consider a two-sample location problem 
of the sphere and apply several classical non-parametric tests to solve it. Test 
statistics are based on projections defined by covariance operator fields. 

In section 4 we consider the problem of interpolation between distribu- 
tions on the sphere, and discuss and compare several alternatives. We also 
show some examples of interpolation between ODFs. The results are en- 
couraging in the possibility of developing new applications for processing 
HARDI. 

Although in all our experiments we stay on the unit sphere , the theoret- 
ical framework still holds on a general Riemannian manifold and this is one 
of its main strengths. 

2 Covariance fields 

2.1 Random variables on manifold 

Let M be a Riemannian n-manifold, q G M and let Exp q be the expo- 
nential map at q, Exp q : M q — > M. If M is complete, then the exponential 
map Exp q is defined on the whole tangent space M q . Throughout this paper 
for convenience we will assume that M is a complete Riemannian n-manifold, 
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although often that is not necessary. 

There is a maximal open set U(q) in M p containing the origin, where Exp q 
is a diffeomorphism. Then the set lA(q) = Exp q (U(q)) is called maximal 
normal neighborhood of q. On this normal neighborhood the exponential 
map is invertible and let 

Log q = Exp' 1 : U{q) -> M p 

be its inverse, the so called log-map. Log q is diffeomorphism on lA{q). We 
adopt the notation qp = Log q p in analogy to the Euclidean case, M = M. n , 
where Log q p = p — q = qp. 

In particular, for M = E> n the log-map has a closed-form expression 

_> cos -1 < p,q > . . 
(1— < p, q > 2 ) 1 / z 

which greatly simplifies metric related operations on the unit sphere. 

The Borel sets on M generated by the open sets on M form a cr-algebra 
A(M) on M. Any Riemannian manifold has a natural measure V on A(M), 
called volume measure. In local coordinates x it is given by 



dV{x) = >y\G x \dx, 

where G x is the matrix representation of the metric tensor, \G X \ is its deter- 
minant and dx is the Lebesgue measure in W 1 . 

Example 1 Consider the two sphere, § 2 , parametrized in geographical coor- 
dinates (6, <fi) . Then the metric tensor is represented by 

G ^ = ( cos°(fl) ) (1) 

and the volume form is V(9, <fi) = cos(6)d6d(p. 

A random variable X on M is any measurable function from a probability 
space (fi, B, V) to (M, A(M), V). The distribution function F of X is defined 

as 

F(A) = V(X'\A)), A e A(M). 

If F satisfies 



F(A)= [ f( P )dV(p),VAeA(M), 

J A 



for almost everywhere continuous (w.r.t. V) function /, then F is said to be 
absolute continuous (w.r.t. V) and / is its density (pdf). 
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2.2 Intrinsic and Extrinsic mean and covariance 



Let (M, p) be a metric space. The Frechet mean set of a distribution 
F is the set of minimizers of Q(q) = J p 2 (q,p)dF(p). It was introduced by 
Frechet (1948). If M is a Riemannian manifold M with metric structure 
g, then the intrinsic mean of F, is the Frechet mean of (M,d g ), where d g 
is the geodesic distance. Karcher(1977) considered the intrinsic mean on 
M and gave conditions for its existence and uniqueness. An alternative to 
intrinsic mean is the extrinsic one, which is obtained by embedding M into 
a higher dimensional Euclidean space. We point to the influential paper 
of Bhattacharya R. and Patrangenaru, V. (2005) where the properties of 
extrinsic and intrinsic means and their relation and asymptotic properties 
are considered in details. 

Once a mean point (intrinsic or extrinsic) is specified, the covariance can 
be defined as usual after fixing a coordinate system about that point. 

To compare two distributions one may first look at their intrinsic means. 
If they differ, the distributions differ, otherwise one may compare further 
their covariances at the common mean point. This approach however suffers 
from at least two drawbacks. First, if the population mean set is large, then 
the finite sample intrinsic mean will have substantial variance. That will 
diminish the power of any test for equality of means and more importantly, 
will inevitably require comparing covariances at different points. Second, the 
intrinsic mean, provided it exists and it is unique, and the covariance alone 
do not specify completely a distribution. 

Thus, if we want to answer the problem of comparing distributions, we 
need a more informative structure that completely represents distributions 
and that is defined in coordinates free manner for seamless manipulation and 
comparison. 

2.3 Covariance operators 

Many parametric families of distributions can be defined as functions on 
linear operators. Consider for example the standard normal distribution in 
W 1 with density 



where \i G M n is its mean. Since ||x — fi\\ 2 = tr((x — fj)(x — fi)') and the 
matrix L(x) = (x — p)(x — p)' defines a linear operator 




L(x)(u, v) 



u'L(x)v = [u'(x - n)] [(x - fi))'v], u, v e M n , 
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we can express the density by f(x) oc h(L(x)), h(T) : = exp(— |tr(T)). 

The von Mises-Fisher and FBK distributions [22] on the unit 2-sphere 
give us other such examples. For example, the latter is given by density 



/(x) 
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exp{^7i • x + /?[(7 2 • x) 2 - (73 • x) 2 ]}, 



where 71, 72 and 73 are three points on § 2 representing orthonormal directions 
in M 3 . We have f(x) oc h(Lx(x), L 2 (x), L 3 (x)), where 



In fact, any L : x t— > L(x) in the presented examples is a field of linear opera- 
tors on tangential spaces. The concept we are going to introduce generalizes 
the above observations. 

We return to a general Riemannain manifold M with metric G. Fix a point 
q G M. Recall that the metric G(q) is a co- variant 2-tensor at M q , while 
the quantity (qp)(op)' is a contra- variant 2-tensor at M q . The contraction of 
their tensor product, G(q)(qp)(qp)' , is a (l,l)-tensor, which is nothing but a 
linear operator at M q . 

Now the idea becomes clear. For a distribution F on M, a linear operator 
at M q can be obtained by taking the expectation of G(q)(qp)(qp)' ,p ~ F. 

From now on we will use the standard notation T 2 (M q ) for co- variant 
2-tensors on M q , T 2 (M q ) for contra- variant 2-tensors on M q and T±(M q ) for 
bi-linear operators on M q . 

Definition 1 Let r : M + — > M + be a continuous function. Covariance of 
distribution F on M at point q € M is defined by 



and S : q 1— > G T 2 (M q ) is called covariance field of F. 

With r = 1 we obtain the generic covariance field associated with F and this 
is the default choice. 

As noted above, G(g)E(g) is a linear operator on M q , which we call covari- 
ance operator. Hence, GS is a field of linear operators on M. With respect to 



Li(x) = G(x)(x^)(x^Y, 
are linear operators at the tangent space at x G S 2 and 

h(Ti, T 2 , T 3 ) = exp( K coa(tr(Ti)) + (3[cos 2 {tr{T 2 )) - cos 2 {tr{T 3 ))]). 
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a coordinate system x at q, G(q)Ti(q) is represented by a symmetric and pos- 
itive definite matrix G X T, X , where G x and S x are the representations of G(q) 
and E(g). In other words, GE is a field of symmetric and positive definite 
operators on M. 

If v G M q has components v x with respect to x, we define 



One can check that indeed the last quantity is invariant to coordinate change 
at q. 

It is worth to mention that for a covariance field E on M, E _1 is also 
symmetric and positive definite and when it is differentiable, E _1 introduce 
a new Riemannian metric on M. 

If Ei and E 2 are two covariance fields on M, then E1E2 1 is a field of linear 
operators, i.e. for any q G M, Ei(g)E^" 1 (g) G Tl(M q ). 

On a complete Riemannian manifold, the problem of minimizing the trace 
of the default covariance field is equivalent to the problem of finding the 
intrinsic mean of F, i.e. 



2.4 Similarity invariants 

Let Sym^ denote the space of symmetric and positive definite matrices. 
Since this is the representation domain for covariance operators it is of ob- 
vious importance for us. Sym^ attracted the attention of many researchers 
in the recent years due to its non-Euclidean nature and consequently, the 
variety of research opportunities it provides. For the purposes of Diffusion 
Tensor Imaging, Fletcher, P. T., Joshi, S., (2007) and Pennec. X., Fillard, 
P., Ayache, N (2006) proposed the use of affine invariant distance, while 
ArsignyV., Fillard, P., Pennec X., and Ayache, N. (2007) proposed the so 
called log-Euclidean distance. A good survey of the available distances and 
estimators in Sym+ along with new ones is provided by Dryden, I., Koloy- 
denko, A., and Zhou, D., (2008). We aim a more general treatment of Sym^ 



(G(q)E{q))v : = E x G x v t 



■jc 



and 



<v,(G(q)X(q))v>:=v' x G x Z x G x v., 
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and instead of dealing with specific matrix functions we define a class of in- 
variants. What particular member of this class should be used is application 
problem specific choice. 

Two matrices A,B £ Sym^ are said to be similar if 

A = X~ l BX, for X G GL n . 

Matrix representations of linear operators are similar and thus, this fact holds 
for the representations of GS and EiE^ ■ Next we define an important class 
of functions that respect similarity. 

Definition 2 A similarity invariant function on Sym+ is any continuous 
bi-variate h that satisfies 

(i) h(AXA', AY A') = h(X, Y), VX, Y G Sym+ and A G GL n . 
It is a non-negative with a unique root if 

(ii) h(X, Y) > 0, VX, Y G Sym+ and h(X, Y) = X = Y. 

Moreover, h is called similarity invariant distance, if in addition to (i) and 
(ii) also satisfies 

(m) h(X, Y) + h(Y, Z) > h(X, Z), VX, Y, Z e Sym+. 

Below we list several examples of similarity invariant function we use in our 
experiments. 

1. For a fixed Z G Sym+, the similarity invariant 

h trdif (X, Y-Z) = \ {tr{Z~ l X - Z~ X Y) \ , 

satisfies (iii) but not (ii). Default choice will be Z = G^ 1 , the inverse 
of the metric tensor representation. 

2. The second one is sometimes referred as affine-invariant distance in 
Syrri2, see for example [29], [15], [5], p3] and [30], and it is defined by 

h trln2 (X,Y) = {tr{ln\XY~ l ))y/\X,Y G Sym^ . 

Actually, h tr i n2 is not a unique choice for a distance in Sym^. 
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3. Log-likelihood function gives us another choice for h, 

h lik (X, Y) = tr(XY- v ) - InlXY' 1 ] - n. 
It satisfies (i) and (ii) but it fails to satisfy the triangular inequality. 

4. Another interesting choice for h is 

h lnpr (X,Y) = {/n(tr(XF- 1 )tr(FX- 1 ))} 1 / 2 , 

that satisfies (iii) and 'almost' satisfies (ii): hi npr (X, Y) = -<=>- X = 
cY, for c > 0. 

The concept of covariance fields can be used for measuring the difference 
between distributions on M. Let / and g be two densities on M and £[/] and 
£[<?] be their respected covariance fields. 

For a non- negative h G SIM.(n) we define 

d h (f,g):= [ hmf}(p),n 9 }(p))dV(p). (3) 

J M 

When M is a compact, the above integral is well defined and finite. Moreover, 
if h(X, Y) is a distance function on Sym+, then dh will be a distance in the 
space of densities on M. 

Equation ([3]) gives a very general but impractical way to compare distri- 
butions due to the fact that the integration domain is the whole manifold. 
For application purposes however, one may restrict to a smaller domain or 
perform the comparison on discrete set of points which are of particular in- 
terest. 

3 Two-sample location problem on § 2 

In this section we make an application of covariance operators to non- 
parametric distribution comparison. It will serve more illustration purposes 
rather than strong application ones. The goal is to provide motivating ex- 
amples showing the new opportunities provided by the proposed covariance 
structure. We choose to apply simple procedures, as Wilcoxon signed rank 
and rank sum tests, in order to have a good look and intuition of what 
happens. 
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Let {Pi,i}£Li and {pi,2}iLi, be random samples from distributions F 1 and 
F 2 on S> 2 , respectively, and the two samples be independent of each other. 
Fix a point q G M and define 

Using tensor notation we can write r)\ G T-^S 2 ). The respective sample 
covariance operators at q are 

771 771 

i=i i=i 

We call g observation point and basically, what we are going to show is how 
its choice influences the inference about Fi and F 2 . 

Fix a tangent vector v G § 2 and consider following (ordinary) random 
variables 

g(v) =< v,r)l(v) > q and g(v) =< v,r]f(v) > g , 

where < ., . > q is the dot product in the tangent space § 2 . 

Definition 3 We say that F\ and F 2 have the same location w.r.t. q G § 2 

and write F\ = q F 2 if for any v G § 2 — M, 2 , random variables 

£ l {v) =< v, (G(q)(qXi)(qXi)')(v) > q , Xi ~ F h 1 = 1,2 
have the same median. 

Under the hypothesis H : i*\ = 9 F 2 , for any v, £l(v) and £ 2 (t>) are random 
samples from distributions with equal median. 

To test H we propose two-procedures based on the Wilcoxon signed rank 
test, [17], page 36. Let Tj^wjQl be the signed rank statistics based on £j(i>)'s 
and T X i = m&x{T X i(v), v G R 2 }. Then we reject Hq when T X i is sufficiently 
large. 

The second test is based on T<j, the Wilcoxon signed rank statistics for 
the distances 

d] = tr(rjl(v)) = d 2 (q,p iA ) and d 2 = tr(rft(y)) = d 2 (q,p iA ), 
where d 2 stands for the spherical distance, d 2 (q,p) = cos _1 (< q,p >). 



X T xi = J2i r * s i> where for = - = l{ 2i>0 } and r t are the ranks of \z. t 
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If we choose an orthonormal basis {v s }l =l of § 2 = M? and define 
=< v s, vh's > q and £ >s =< v s ,rfv s > q , 
then the following holds: for any i = 1, m and 1 — 1,2 

2 2 

EC = E < > 2 = ikii 2 = = 4 ( 4 ) 

S=l 8=1 

It is clear that if ^(ui) and £ 2 (t>i) have the same median and ^(^2) and 
£ 2 (t>2) also have the same median, then ^(v) and £ 2 (t>) will have the same 
median for every v. 

{dj}i and {df}i are samples from marginals of F\ and F2, which under 
the null hypothesis have the same location. As distances, they are invariant 
to rotation of the samples pij on the sphere. On the other hand, for any 
I, and {C, l i2 }i follow two marginal distributions that can be consid- 

ered projections of Fi onto two orthogonal axes. As such they form more 
discriminating set of variables than {d[}i. 

These observations motivate the following procedure for testing Hq. 

Test Procedure 1 Let {Pi,j}£Li, 1=1,2 be two random samples, independent 
of each other. 

1. Find the operators 77* and rjf and set 

TTL TTL 

L{q)=L\ q )-L\ q ) = ^Y.^'-Y.^- 

i=l i=l 

2. Let \ s and v s be the eigenvalues and eigenvectors of L(q). 
Settl, =< v s ,rjlv s > q . 

3. Calculate statistics T xi s based on ^ s and set 

max{T xiA ,T xiy2 }. 

4- Choose a significance level a. If pval{T xi ) < a/2^, reject Hq. 
2 We apply Bonferroni correction for the p-value. 
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Figure 1: Testing H : Fi = q F 2 when H is false. Sample examples from 
Fi (red) and -F 2 (blue) are given in the left. Observation point q, shown in 
green, is fixed and equals distribution parameter \x. Top right plot shows T xi 
and Td statistics along with 1 and 5 percentile lines, while the bottom right 
plot shows W X i and Wd by their p- values. Test procedure 1 is run 100 times 
with sample size of 50. T xi clearly outperforms T d statistics by rejecting the 
null hypothesis most of the time and the same is true for W X i versus Wd- 

We also employ the rank sum test (Wilcoxon, Mann and Whitney), [XT] , 
page 106, to compare the performance of £j and di s i random variables. For 
the statistics W X i □ and Wd, we calculate corresponding p- values using large 
sample approximation. The second test procedure is the same as the first 
one but uses W instead of T statistics. 

Note that if Fi distribution is a rotation of F\ about q, then the type II 
error of Td statistics will be 1, i.e. the power will be 0. 

The way of choosing the basic vectors v s of §^ resembles the Principal 
Component Analysis (PCA) of the operator L(q) and its derivatives like 
Principal Geodesic Analysis (PGA), introduced by Fletcher, 2004. In the 
standard setup, PCA is applied on the covariance defined at the (extrinsic 
or intrinsic) mean point. However, not only the existence of a mean is not 
guaranteed, but its properties may not be optimal in the context of the test 
statistic. In contrast, in our approach we allow freedom of choosing the 
observation point q according to a criteria favoring that statistic. 

Figures Q] and [2] show some experimental results using the proposed pro- 

3 W X i = Y^i r »i where for r, are the ranks of in the joint sample {£* }. 
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Figure 2: Performance of T and W statistics when the observation point q 
varies. In the left plot q is chosen uniformly on the sphere. The experiment 
confirms a clear advantage for T X i. In the right plot preference is given to 
those observation points for which tr 2 (L) is large. Now is on a par to 
T xi with both being very high. Corresponding and W X i also have very 
significant p-values. 



cedures for testing H : F\ = q F 2 . We consider a family of distributions given 
by density 

/(PI a, oc exp(-(tr(G(fi)(JIp)(JIp)) 2 - a) 2 ) (5) 

where \i is a fixed point (not to be mistaken as a mean) and a is a parameter. 
Top plots show Wilcoxon sign rank statistics T, while bottom plots show rank 
sum statistics W. As we see in figure [2] left, where the observation point q 
varies uniformly on § 2 , for the majority of positions, T xi and W X i achieve 
higher p-values than and Wd- This result is not isolated and can be 
repeated for a great variety of distributions besides (jSJ). 

How the choice of observation point q affects the relative performance of 
T and W statistics? We have that 

2 m 

d\-d" = " O and - £(£t - = A s , 

s=l i=l 

thus 
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Figure 3: Comparing the performances of T X i and Td statistics (top) for a 
fixed pair of samples by varying the observation point. Samples are drawn 
from (jHJ). Observation points are ordered decreasingly in det(L(q)) in the left 
plot and decreasingly in tr 2 (L(q)) in the right. The bottom plot shows the 
eigenvalues of L(q). We note that Td is the larger statistics and thus, has 
lower p-values, only when both eigenvalues are strictly positive. 

Therefore, if all A s are of equal sign, the absolute value of the sample ex- 
pectation of (dj — df) will be higher than that of (£* s — £? s ), for all s. In 
case when the eigenvalues are of different signs the reverse is expected, the 
absolute value of the sample expectation of (£* s — for the maximal |A S | 
will be higher than that of [d\ — d 2 ), which means that T X i is expected to be 
higher than Td- Of course these considerations are only approximate because 
the tests for T and W statistics are based on assumptions on the medians 
not on the means. Nevertheless, we may take the above as a general obser- 
vation that can be made more formal and rigorous using other appropriate 
statistical tests. 

We provide some experimental evidence confirming the above expecta- 
tions. For comparison the performance of T xi and T d we use det(L(q)). We 
expect for T d to benefit from positive values of det(L(q)) and indeed this 
is the case as seen in figure [3] left. There, for a fixed pair of samples, we 
calculate and compare T xi and Td statistics at 50 observation points on the 
sphere. Then we sort the results such that det(L(q)) decreases. In the far 
left, both Ai and A2 are positive, which leads to a clear advantage for Td- 
Once the sign of det(L(q)) goes negative, the situation reverses. 

We also expect that at observation points with high values of tr 2 (L), all 



14 



statistics to be strong in rejecting a false null hypothesis. Some evidence 
confirming this is shown in figures [2] right and [3] right. tr 2 (L) is probably 
the simplest statistics that measures the difference between the two samples 
and it is in fact, an application of the similarity invariant function htrdif a s 
defined in section 2.2. 

One can show that L is a continuous field of linear operators on S 2 (the 
proof is beyond the scope of the paper). Therefore, if there exists a point q 
with det(L(q)) < 0, then that sign is negative on non-vanishing area. Only 
when samples p^i collectively are highly concentrated, the area S+ where 
det(L(q)) > will dominate over S_, the area where det(L(q)) < 0. In case 
when H is false, we expect that S + < 5_. 

Figure H] gives another useful way to visualize the sample operator L 
at different observation points. By choosing a point q, one can draw the 
projections < v,r)l(v) > q for a set of directions v spanning a circle to obtain 
the so called sample profile. 

In conclusion, choosing an observation point for comparing locations of 
two distributions is an important issue since not all positions provide same 
test performance. Position optimality depends on the statistic applied on the 
covariance operator. For the projection based statistics we used as examples, 
optimal observation points can be chosen by maximizing the squared trace 
of the difference of the covariance operators. 

We also showed that distance based statistics have limited performance 
and in general, employing the whole covariance structure is beneficial. 

We also note that most of the presented results do not depend on the 
specific geometry of the unit sphere and still hold on a general Riemannian 
manifold. 

4 Interpolation of discrete distributions on S 2 

The second application of the covariance operators we are going to con- 
sider is interpolation between discrete distributions on the unit sphere. We 
suppose that the distributions are defined on a common domain - a fixed set 
of points on the sphere. The approach we propose is first, to generate an in- 
terpolated field based on the covariance fields of the initial distributions and 
second, to find a probability mass function which covariance field is close to 
the interpolated one. Closeness is measured using a suitable similarity invari- 
ant function. Covariance fields are also considered discrete ones - they are 
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Figure 4: Two samples of size 50 are drawn from f(p; 0.2) and f(p; 0.3) as 
given by (jSJ). Points g m j n and g max are chosen to minimize and maximize 
tr 2 (Li — L 2 ). Shown are sample profiles and their difference (right) at these 
points, defined by the projections £ =< v,rj(v) > along 50 directions v 
spanning uniformly [0, 2tt]. Profiles of £ x and £ 2 are concentrated and look 
similar at q m in, but are diffused and very different at q m ax- These plots 
visualize clearly the difference between two extreme observation points. 
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defined on a finite set of observation points. With a fixed coordinate system 
at each observation point, not necessarily a global one, the covariance field 
is represented by a set of matrices. As always, we are going to use the tensor 
notation to guarantee a coordinate free approach. 

Let {pi}i = i and {gj}f =1 be two sets of k points on § 2 . The first set is 
the distribution domain. The second one is the observation set. Hereafter, a 
discrete mass function (pmf) is any k-vector /, such that / = {/j = f(pi) > 
0}i = i and Y^h=i fi — 1- We write / G P£ , where P^ denotes the compact 
k-simplex. 

The number of observation points may be in fact less than k, the size of the 
pmfs. However, with a smaller observation set one may lose the uniqueness 
and the continuity of an estimation. Particular geometric configurations 
also lead to the same result and one has to check carefully the consistency 
conditions corresponding to the problem. 

The covariance field of / G P^ at qj is defined as 



S L/]j : = S L/]fe) = ^(q^)(q~jP^)'r(\\qJPi\\)f(Pi 

i=l 

where 

cos" 1 < pi, qj > 



(l-< Pil qj> 2 

We use either r = 1 or 



yj^iPi- < Pi,qj > qj)- 



r(t) = (1 - |) 2 - (6) 

The second choice is known to be optimal on S 2 in the class of functions 
r a{t) = (1 — f ) 2 because it minimizes the maximum of tr(GT,(q)). 
Let / s ,s=l,...,m, be a collection of pmfs and 

be their covariance fields. 

For a non-negative similarity invariant function h, we define 

k 

dh(f, f s ) ■= £ h(E[f]j, q),s = 1, m. (7) 
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For a G P+, i.e. a = {a s }™ =1 , such that a s > and J2 s a s = 1, we define 
the functional 

m 

H(f;a):=J2®sd h (f,f s ). (8) 

Then we formulate the following optimization problem: find a probability 
mass function / such that 

f(a)=argmirifH(f;a). (9) 

Below we show some results regarding the consistency of the estimators 

©• 

Lemma 1 Let h G SIM.(n), a s G P^ and f s G Pj[. If a s — > ao and 
/ s ~~ * f° (i n L2 norm), then 

H(f s ,a s )^H(f°,a ). 

Proof. Observe that 

\H(f s ; a.) - H(f°; a )\ < \H(f; a s ) - H(f; a )\ + \H(f; a ) - H(f°; a )\. 

Since H(f; ao) is continuous in /, the second term above goes to zero. The 
first term is bounded by 

\H(f; a,) - H(f s ; a )| < \\a s - a \\ L2 max/ i (S[/] J , Cf). 

The sets G P^} are compact in Sym^ and h is continuous, therefore 

m&Xj tm h(T,[f]j,Cj) = C < 00 and 

H(f s ;a s )^H(f s ;a ). 

□ 

For a sequence a s , define f s = argminfH(f; a s ). We have the following 

Lemma 2 If h G SXAi(n) and a s — > ao, then 

H(f s ,a s )->H(f°,a ). (10) 
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Proof. Since P£ is a compact any sub sequence of f s has a point of conver- 
gence in P£ . Without loss of generality we may assume that f s —> g G P^. 
Accounting for the minimizing properties of / and applying lemma [1] we can 
write 

H(f°,a s ) > H(f s ,a s ) - H(g,a ) > H(f°,a ). 

Because of H(f°, a,) -> H(f°, a ) we have jTD|). □ 

Unfortunately, (fTUj) is not enough to claim that f s — > /°. However, if 
//"(/; «o) has a well separated minimum at /° we indeed have the wanted 
consistency. 

Corollary 1 f(a) is continuous at all a for which H(f,a>) has a well sepa- 
rated (global) minimum. 

Another problem is how to find the global minimum / of H(f; a), pro- 
vided it is unique. We know that the minimum is easily found in case of 
convex function H, by gradient descent algorithm for example. Moreover, 
the convexity of H(f; cko) m P£ guarantees the well separability of its mini- 
mum and that gives us the desired consistency. 

Proposition 1 If a s — > ct an d h G SXAi(n) is such that H(f; «o) is convex 
in P£ , then 

r - f°. (ii) 

Proof. Suppose the contrary of (ITTjh that there exists g G P£, and sub 
sequence f s — > g, such that ||/° — g\\ > 0. Then H(g;a ) > H(f°;a ) by 
the separability of the minimum. But H(f s ; a s ) — > H(g; a ) by lemma [T] and 
H(f s ; a s ) -> H(f°; a ) by lemma M, which imply a ) = H(f°; a ). The 
contradiction proves the claim. □ 

4.1 Linear Interpolation 

Consider first one of the simplest similarity invariant functions hj rdi f(., .; G~ l ). 
The corresponding optimization functional is 

m k 

8=1 j=l 



19 



Denote = tr{G(q 3 )(qjp\)(qjp\)') = d 2 (qj,Pi) and = tr(G(qj)Cj), then 

m k 

Htrdifif, a) = ^a s ^2(Y1 a ^ ~~ c i) 2 - 

s=l j=l i 

We have 

m k 

8=1 j=l I 



dHtrdif 



The second partial derivatives are 

m k 

8=1 j = l 

Let w = {iUi} e R k , then 

m k k 



9 2 H tr dif 



d 2 H 

WiWi df tr d 7 = 2 Y as ^ a ^ 2 - °- 

i,i s=l j=l i=l 



Therefore, if the matrix A = {a^}^ - =1 is of full rank fc, then H trc nf is 
convex in P^J 1 ". Moreover, the optimal solution of ([9]) satisfies 

Y a ijfi = Y a sCjJ = l,...,k, 

i s=l 

with a unique solution 

m 

8=1 

since for every s and j, £\ a^// = cj. 

Thus, we showed the following 
Proposition 2 // £/ie matrix A has full rank, rank(A) = k, then the linear 
interpolation is the unique solution of the optimization problem |PP for H tr( nf. 

4.2 Non-Linear Interpolations 

Consider similarity invariant function h tr i n 2 and corresponding optimiza- 
tion functional H tr i n2 

m k 

H trln2 (f;a) = ^o^tr^Et/Mqr 1 )). 

8=1 j = l 
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The value of H tr i n2 (f) is small when G(qj)H[f]j is close to covariance op- 
erators G(qj)Cj for all j and s. This is a much stronger condition than 
the requirement for their traces to be close as in the problem of minimizing 
Htrdif- Consequently the minimum of H tr in2(f), in general, will be strictly 
positive and the optimal pmf will be different from the linear interpolation. 

Experiments show great improvement in convergence of gradient descend 
algorithm for problem (jUj) , when instead of the generic covariance one uses 
the second choice 

Define the operators 

^ = (^)(^)'(i-^ [I ) a (q)- 1 

and set Y- = J2i h^ty The gradient of H tr in2 is 

s =i i=i * r( ^ 

The optimization problem (j5J) is solved by gradient descent algorithm, which 
shows relatively fast convergence, unfortunately not always to the global 
minimum, because H tr i n 2{f, &) is not convex in / 6 P^ 1 ". 
Log-likelihood function gives us another choice for H, 

m k 

H lik (f;a) = E^E^KS^Cq)- 1 ) -ZnlS^-Cq)" 1 ! -n} = 

8=1 j=l 

^a a ^2{tr0r/)-ln\Yf\-n}. 

8 = 1 J=l 

The gradient of H iik is 

8=i j=i u \ Zj ijl 

Note that huf. is neither symmetric nor satisfies the triangular inequality, 
but its importance is determined by the relation to normal distributions and 
its analytical properties. Define the matrix 

B = {bij = {d{qj,pi) - |) 2 }Jij = i. 
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Proposition 3 If B has full rank, rank(B) = k, then for all a, Huk(f; a) 
is a convex function in P£. 

Proof. We have 



and 



dfi 

Jl 8=1 j=l 



o2rr m 

O nuk 



dfM 

We want to show that the matrix of second partial derivatives is positive 
definite. Let w = {wi} £ IR fc and w^O, then 

a2 tt m k k 

E ^>am = E «. E ir E - wr 1 )' > o. 

i,j 1/1 «=i i=i i=i 

since by the assumption for 5, for at least one j, Yli=i w i^tj 7^ 0. □ 

The rank of B can be calculated using the pairwise distances between q 
and p points and only in very special circumstances this rank will be less 
than k. More formally, if a random process chooses the points, then 

P(rank(B) < k) = 0. 



4.3 Examples and conclusions 

Figure [5] shows interpolation between two pmfs of size 6 (m = 2, k = 6) 
applying h tr i n 2. We compare it to the linear and the square root interpo- 
lations. Square root interpolation, as suggested by the name, relies on the 
observation that for a pmf f £ P^, y/f = (y/f^, \ff^) £ § fc - Then one 
finds 

V = argmin p& k ^ a sd 2 (p, \JT S ) and sets f sqroot = p 2 . (12) 

s 

It is also informative to compare the Mean-Squared Error (MSE) between 
different interpolations. It is defined by 

2 k 

MSE(f) = J2 a °J2(fi-fi) 2 - ( 13 ) 

8=1 1=1 
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Figure 5: Two examples of interpolation of pmfs on § 2 using h tr i n2 . The 
linear and square root (see (|12|) ) interpolations are also given for reference. 
Top plots show H tr i n2 and Hn^ for the three interpolations. Bottom plots 
show corresponding MSEs (see (JT3"]) ) in the left and FAs (see (fT4|) ) in the 
right. 

Linear and square root interpolations, by their nature, are very close in MSE, 
but very different from f tr i n 2(a), which manifests the non-linear origin of the 
latter. 

Another performance criteria relevant to the study of spherical data is the 
Fractional Anisotropy (FA). Let {Aj}™ =1 be the eigenvalues of ^2 i=1 Pijk'fi, 
where are considered vectors in M 3 (thus FA is defined only for distribu- 
tions on S> 2 ). Then we define 

n n 

FA(f) = {-^- J> - A) 2 / £ A 2 } 1/2 . (14) 

i=l i=l 

Fractional Anisotropy measures a distribution concentration. The higher FA 
the more concentrated it is about particular axes. A uniform distribution has 
FA = 0. As we may expect the linear interpolation substantially reduces the 
FA index, /i^^-based one however, is more conservative and manage to 
sustain higher FA. Preserving the concentration factor is of importance for 
processing ODFs in HARDI, and the empirical evidence for the good FA 
performance of h tr i n 2 is encouraging. 

A second set of examples in figure [6] illustrates interpolation based on the 
likelihood function, huk- As we showed, this choice guarantees the convexity 
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Interpolation u f f + (1-a)f 2 Interpolation otf + (1-a)f 2 





Figure 6: Two examples of interpolation of pmfs on S 2 using huf.. The linear 
and square root interpolations are also given for reference. Top plots show 
for the three alternatives. Bottom plots show corresponding MSEs in 
the left and FAs in the right. 

of Huk and thus the continuity of the optimal solution fuk(ct). 

The likelihood based interpolation fuk exhibits behaviour similar to that 
of ftrin2- Again, it is very distinguished from the linear and the square-root 
one and tends to preserve the anisotropy. 

5 Summary 

The main goal of this article is to introduce covariance operator fields and 
provide some arguments showing their potential and usefulness. 

There is a covariance field associated with any distribution on a Rieman- 
nian manifold. It defines a linear operator on the tangent space of each point 
on the manifold. By applying a similarity invariant to that operator field one 
can obtain a scalar field that represents the distribution. It reveals important 
spatial characteristics of the distribution. Similarity invariants can also be 
used for comparing and interpolating distributions. 

We demonstrated several non-parametric procedures for solving a two- 
sample location problem on the sphere and showed how covariance operator 
fields can be used for locating observation points that maximize test perfor- 
mance. 

We also implemented two non-linear procedures for interpolating distri- 
butions on the sphere and compared them to the linear and square-root 
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interpolations. The proposed approach is general enough to allow a great 
variety of choices and promises a good application potential. 
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